Pandas Merging¶

# Import pandas
import pandas as pd

user_usage = pd.read_csv("user_uses.csv")
user_device = pd.read_csv("user_device.csv")
devices = pd.read_csv("supported_devices.csv")
devices.rename(columns={"Retail Branding": "manufacturer"}, inplace=True)

user_device.head()

devices.head(5)

Merge user_usage with user_device.¶

We're trying to get the average usage for different types of devices. So we need to get the user's device code from user_usage as a column on user_usage, and then get the device's manufacturer from devices as a column on the result.

First, we merge user_usage with user_device with "use_id" as our common column

result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id')
result.head()

Left merge example¶

A left merge, or left join, between two dataframes keeps all the records from the left dataframe, in this case "user_usage". Rows from the right dataframe will be kept in the result only where there is a match in the merge variable in the right dataframe, and NaN values will be in the result where not.

result_left = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', how='left')
print("user_usage dimensions: {}".format(user_usage.shape))
print("result dimensions: {}".format(result.shape))
print("There are {} missing values in the result.".format(
        result['device'].isnull().sum()))

user_usage dimensions: (240, 4)
result dimensions: (240, 6)
There are 81 missing values in the result.

result_left.head()

Right merge:¶

A right merge, or right join, between two dataframes keeps all of the records from the right dataframe, in this case "user_device". Rows from the left dataframe will be kept where there is a match in the merge variable, and NaN values will be in the result where it's not.

result_right = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', how='right')
print("user_device dimensions: {}".format(user_device.shape))
print("result dimensions: {}".format(result.shape))
print("There are {} missing values in the 'monthly_mb' column in the result.".format(
        result['monthly_mb'].isnull().sum()))
print("There are {} missing values in the 'platform' column in the result.".format(
        result['platform'].isnull().sum()))

user_device dimensions: (272, 6)
result dimensions: (240, 6)
There are 0 missing values in the 'monthly_mb' column in the result.
There are 81 missing values in the 'platform' column in the result.

result_right.head()

Outer merge:¶

A full outer join, or outer merge, keeps all rows from the left and right dataframe in the result. Rows will be aligned where there is shared join values between the left and right. Rows with NaN values, in either the left-originating or right-originating columns will be. it will be left in the result where there is no shared join value.

print("There are {} unique values of use_id in our dataframes.".format(
        pd.concat([user_usage['use_id'], user_device['use_id']]).unique().shape[0]))

result_outer = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id', how='outer', indicator=True)

print("Outer merge result has {} rows.".format(result.shape))

print("There are {} rows with no missing values.".format(
    (result_outer.apply(lambda x: x.isnull().sum(), axis=1) == 0).sum()))

There are 353 unique values of use_id in our dataframes.
Outer merge result has (240, 6) rows.
There are 159 rows with no missing values.

result_outer.head()

Merging with Device manufacturer¶

# First merge user_usage and user_device to add the platform and device to the user usage.
result = pd.merge(user_usage,
                 user_device[['use_id', 'platform', 'device']],
                 on='use_id',
                 how='left')

# Now, based on the "device" column in result, match the "Model" column in devices.
devices.rename(columns={"Retail Branding": "manufacturer"}, inplace=True)
result = pd.merge(result, 
                  devices[['manufacturer', 'Model']],
                  left_on='device',
                  right_on='Model',
                  how='left')

result.sample(5)

# You can apply filter on dataframe according to your reuirement.
devices[devices.Device.str.startswith('GT')]

	use_id	user_id	platform	platform_version	device	use_type_id
0	22782	26980	ios	10.2	iPhone7,2	2
1	22783	29628	android	6.0	Nexus 5	3
2	22784	28473	android	5.1	SM-G903F	1
3	22785	15200	ios	10.2	iPhone7,2	3
4	22786	28239	android	6.0	ONE E1003	1

	manufacturer	Marketing Name	Device	Model
0	NaN	NaN	AD681H	Smartfren Andromax AD681H
1	1&1 10.or	1&1 Puck 10or_G2	FJL21\nhws7721g diw362_1u1 G2	FJL21\nMediaPad 7 Youth 2 DIW362P 1U1\n10or G2
2	10.or	D	10or_D	D
3	10.or	E	E	E
4	10.or	G	G	G

	outgoing_mins_per_month	outgoing_sms_per_month	monthly_mb	use_id	platform	device
0	21.97	4.82	1557.33	22787	android	GT-I9505
1	1710.08	136.88	7267.55	22788	android	SM-G930F
2	1710.08	136.88	7267.55	22789	android	SM-G930F
3	94.46	35.17	519.12	22790	android	D2303
4	71.59	79.26	1557.33	22792	android	SM-G361F

	outgoing_mins_per_month	outgoing_sms_per_month	monthly_mb	use_id	platform	device
0	21.97	4.82	1557.33	22787	android	GT-I9505
1	1710.08	136.88	7267.55	22788	android	SM-G930F
2	1710.08	136.88	7267.55	22789	android	SM-G930F
3	94.46	35.17	519.12	22790	android	D2303
4	71.59	79.26	1557.33	22792	android	SM-G361F

	outgoing_mins_per_month	outgoing_sms_per_month	monthly_mb	use_id	platform	device
0	21.97	4.82	1557.33	22787	android	GT-I9505
1	1710.08	136.88	7267.55	22788	android	SM-G930F
2	1710.08	136.88	7267.55	22789	android	SM-G930F
3	94.46	35.17	519.12	22790	android	D2303
4	71.59	79.26	1557.33	22792	android	SM-G361F

	outgoing_mins_per_month	outgoing_sms_per_month	monthly_mb	use_id	platform	device	manufacturer	Model
28	135.09	42.02	5191.12	22833	android	E6653	NaN	NaN
3	94.46	35.17	519.12	22790	android	D2303	NaN	NaN
38	1221.85	69.20	6229.33	22858	android	ONEPLUS A3003	NaN	NaN
52	128.34	1.77	7469.79	22881	android	HTC Desire 825	NaN	NaN
182	583.73	216.52	1441.29	23548	NaN	NaN	NaN	NaN

	manufacturer	Marketing Name	Device	Model
7	3Go	GT10K3IPS	GT10K3IPS	GT10K3IPS
8	3Go\n3Go 3Q	GT70053G\nGT7007EQC OC1020A	GT70053G\nGT7007EQC OC1020A	GT70053G\nGT7007EQC OC1020A